Problem Definition

The goal of this project is to analyze the yearly mean sunspot numbers dataset from 1700 to 2023, obtained from the Solar Influences Data Analysis Center of the Royal Observatory of Belgium (available here). This dataset represents the yearly mean total sunspot number, calculated as the arithmetic mean of the daily total sunspot number for each year.

The project aims to provide insights into trends and develop a reliable model for the data. Sunspots, which are temporary phenomena on the Sun’s photosphere appearing as darker spots, are crucial for understanding solar cycles with implications for space weather and climate. Understanding their patterns and periodicity is essential for these analyses.

Aim

The primary objectives are as follows:

  1. Conduct an initial exploratory data analysis to understand the characteristics, trends, and patterns present in the dataset. This analysis will involve examining summary statistics, visualizations such as time series plots, and identifying any notable features or anomalies.

  2. Propose a set of possible models using various model specification tools such as Autocorrelation Function (ACF), Partial Autocorrelation Function (PACF), Extended Autocorrelation Function (EACF), and Bayesian Information Criterion (BIC) table.

  3. Fit all the proposed models to the dataset to obtain parameter estimates. This step involves interpreting the estimated coefficients and assessing their significance.

  4. Using the goodness-of-fit metrics such as Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Mean Squared Error (MSE), etc., to compare, select the best-fitted model among the set of proposed models and utilizing that best model to forecast for next 10 years.

Setup

This section ensures that the necessary setup steps are performed before proceeding with the analysis, including setting the working directory and importing required packages for data manipulation, visualization, and time series analysis.This preparation step is crucial for conducting a smooth and effective analysis process.

Setting Working Directory

# Add user working directory path.
setwd("~/Documents/RMIT/TimeSeriesAnalysis/TSA-A3")

# Verify present working directory.
getwd()
## [1] "/Users/chris/Documents/RMIT/TimeSeriesAnalysis/TSA-A3"

Dependencies

# Code to install dependencies.
# install.packages(c('tidyverse', 'TSA', 'forecast', 'lmtest', 'tseries', 'urca'))

# Import Dependencies.
library(tidyverse)
library(TSA)
library(forecast)
library(lmtest)
library(urca)
library(tseries)
library(patchwork)

# Source Utility Script
source("./utilities.R")

Details on the functions used from the Utility Script (utilities.R) can be found in the Appendix section of this report.

Data

The data set under analysis in this assignment represents yearly mean Sunspot Numbers, calculated as the arithmetic mean of the daily total sunspot number for each year from 1700-2023. Sourced from Solar Influences Data Analysis Center of the Royal Observatory of Belgium (available here), the data set offers insights into long-term climate trends and variations.

# Reading Data into R environment.
sunspot_data <- read.csv('./Resources/YearlySunspotData.csv', sep = ';', header = FALSE)

# Extract the first two columns
sunspot_data <- sunspot_data[, 1:2]

# Assign column names
colnames(sunspot_data) <- c("Year","MeanSunspotNumber")

# Display Sunspot Data
sunspot_data %>% head(15)
##      Year MeanSunspotNumber
## 1  1700.5               8.3
## 2  1701.5              18.3
## 3  1702.5              26.7
## 4  1703.5              38.3
## 5  1704.5              60.0
## 6  1705.5              96.7
## 7  1706.5              48.3
## 8  1707.5              33.3
## 9  1708.5              16.7
## 10 1709.5              13.3
## 11 1710.5               5.0
## 12 1711.5               0.0
## 13 1712.5               0.0
## 14 1713.5               3.3
## 15 1714.5              18.3

The data set consists of observations starting from 1700 to 2023 (constituting 324 years inclusive of the start and end terms). It is essential to check if all years are accounted for in the data set without inconsistent or null values as this might be an indication to an incomplete sequence.

# Calculate number of years between 1700 - 2023 (324 years)
true_years = (2023 - 1700) + 1 

# Verify year count with true year count
cat("Number of years between 1700 - 2023 (inclusive): ", true_years, 
    "\nNumber of years accounted for out of 324 years: ", sunspot_data %>% nrow(),
    "\nCount of null values in the data set: ", sum(is.na(sunspot_data$MeanSunspotNumber)))
## Number of years between 1700 - 2023 (inclusive):  324 
## Number of years accounted for out of 324 years:  324 
## Count of null values in the data set:  0

Convert to TS Object

Converting CSV (Comma-Separated Values) data to a ts() object in R for time series analysis serves the purpose of enabling efficient manipulation, exploration, and modeling of time-dependent data using specialized functions and tools designed for such analyses.

Time series data inherently includes temporal information, such as time stamps or time intervals between observations. The conversion to a ts() object ensures that R can effectively recognize and leverage this temporal aspect of the data for conducting time-based analyses, thus avoiding the risk of overlooking or disregarding important temporal characteristics.

Note: The frequency for the observed series is set to 11 for two primary reasons. Firstly, the ACF plot of the raw series reveals periodic cycles occurring roughly every 11 years which is reported in the Appendix sections of the report. Secondly, detailed domain analysis of solar cycles corroborates this pattern, as each solar cycle is approximately 11 years in duration. The Determining Frequency section in the Appendix includes additional analysis for finding the frequency of the series.

# Convert data to time series object
sunspot_TS <- sunspot_data$MeanSunspotNumber %>% ts(frequency = 11)
                                                    
# View first 20 years of the series
sunspot_TS %>% head(20)
## Time Series:
## Start = c(1, 1) 
## End = c(2, 9) 
## Frequency = 11 
##  [1]   8.3  18.3  26.7  38.3  60.0  96.7  48.3  33.3  16.7  13.3   5.0   0.0
## [13]   0.0   3.3  18.3  45.0  78.3 105.0 100.0  65.0

EDA

The Exploratory Data Analysis (EDA) section initiates the exploration and summarization of the time series data being analyzed. This involves verifying the data type of the time series object (sunspot_TS) to ensure its suitability for further analysis. Furthermore, summary statistics are provided to offer a comprehensive overview of the distribution and features of the mean sunspot numbers data.

Exploration

# Verify Data type 
sunspot_TS %>% class()
## [1] "ts"
# Summarize the time series object
sunspot_TS %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.57   65.55   78.53  115.47  269.30
# Create box plot to visualize sunspot number
sunspot_TS %>% boxplot(horizontal = TRUE, 
                       main = "Figure 1: Yearly Mean Sunspot Number (1700-2023)",
                       xlab = "Mean Sunspot Number", 
                       border = "black")
# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

The summary statistics offer insight into the distribution of sunspots spanning from 1700 to 2023. The minimum value of 0.00 and a median value of 65.55 indicating an almost near even split where almost half of the recorded sunspots are below this value and the other half are above. These statistics provide a glimpse into the variability and range of mean sunspots over the analyzed time frame.

Moreover, the box plot (Figure 1) demonstrates a relatively uniform distribution of number of sunspot observations without many outliers. There are only 3 outliers observed which exhibits the largest sunspot areas within the series.

Visualizing data

In this section, a time series plot is created to depict the trajectory of sunspots over time. The goal is to extract meaningful insights regarding the data’s characteristics, including trends, seasonality, variance changes, potential change points, and overall data behavior. This step is crucial for informing the model selection process and guiding decision-making effectively.

# Plot sunspot series across time
sunspot_TS %>% plot(type = 'o',
                    main = "Figure 2: Yearly Mean Sunspot Number (1700-2023)", 
                    xlab = "Year", 
                    ylab = "Mean Sunspot Number")

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

Plot Analysis

  1. Trend:

    The plot indicates that there is minimal trend present within the sunspot number data. This initial visual assessment suggests that the series exhibits stationarity, a crucial characteristic of time series data. The lack of a clear trend and the apparent stability of the statistical properties in the plot support the hypothesis that the series is stationary or trendless (based on visual assessment). However, we do acknowledge that there is a possibility that the underlying seasonality of the data might be masking the trend.

  2. Seasonality:

    The time series plot clearly exhibits seasonality in the data. Seasonality refers to patterns that repeat over specific periods, such as a year, quarter, or month. In this case, there is a distinct pattern of ups and downs recurring across the years, as evident in the graph. These regular fluctuations indicate that the data has a seasonal component, which is an important aspect to consider in time series analysis and forecasting. Recognizing seasonality helps in building more accurate models by accounting for these predictable changes.

  3. Changing Variance:

    Visual inspection of the time series reveals no severe fluctuations in the variability of sunspot observations. Some fluctuations manifest as periodic bursts of high variability, followed by intervals with significantly lower variance. During these lower variance periods, the data points exhibit minimal dispersion. This pattern indicates that the sunspot observations experience cycles of high and low activity, which is crucial for understanding the underlying dynamics of the series.

  4. Behavior:

    Due to the presence of seasonality, it is highly likely that the underlying pattern cannot be inferred solely from a visual assessment. The time series plot indicates strong and predominant autoregressive (AR) behavior, as most points exhibit a pattern where past observations influence future ones, suggesting a relationship between successive observations. However, there are minimal regions that display jumps and gaps, particularly near the peaks and tails of each cycle, indicating a possible moving average (MA) influence. This combination of AR behavior and occasional MA characteristics highlights the complex dynamics of the sunspot observations, where both past values and random shocks play a role in shaping the time series.

  5. Change/Intervention Point:

    A change point can be defined as a juncture in the series where there is an abrupt and significant alteration in either the series behavior or trend. Based on this definition, it could be argued that no distinct change point is discernible in the series. The overall behavior of the series remains relatively consistent, without any abrupt shifts or notable deviations that would indicate a significant change in the underlying pattern or trend. This suggests that the series maintains its general characteristics throughout the observed period.

Assessing Auto-Correlation

As observed in the previous section Visualizing Data, the series exhibits indications of autocorrelation. To further explore and gain deeper insights into the data, we will conduct an autocorrelation analysis by examining the correlation of the series with its first and second lag.

# Creating first and second lags for data
series <- sunspot_TS
first_lag <- zlag(sunspot_TS)
second_lag <- zlag(zlag(sunspot_TS))

# Setting index for correlation test
index1 <- 2:length(first_lag)
index2 <- 3:length(second_lag)

# Pearson's Correlation Test for first and second lag's
cor_first_lag <- cor(series[index1], first_lag[index1])
cor_second_lag <- cor(series[index2], second_lag[index2])

# Results of Pearson's correlation coefficient test
cat("Pearson's Correlation between series and first lag: ", cor_first_lag %>% round(4), 
    "\nPearson's Correlation between series and second lag: ", cor_second_lag %>% round(4))
## Pearson's Correlation between series and first lag:  0.8176 
## Pearson's Correlation between series and second lag:  0.438
# Set up a 1x2 layout for plots
par(mfrow=c(1,2))

# Visualization of correlation between series and first lag
series[index1] %>% plot(first_lag[index1],
                        xlab = "First Lag of Sunspot's Series",
                        ylab = "Shares Series",
                        main = "Figure 3.1: Scatterplot of Sunspot's series and it's First Lag") %>% 
                   text(x=40, y=260, col='blue',
                        labels=paste0("Correlation value: ", cor_first_lag %>% round(4)))

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Visualization of correlation between series and second lag
series[index2] %>% plot(second_lag[index2],                        
                        xlab = "Second Lag of Sunspot's Series",
                        ylab = "Share's Series",
                        main = "Figure 3.2:Scatterplot of Sunspot's series and it's Second Lag") %>% 
                   text(x=40, y=260, col='blue', 
                        labels=paste0("Correlation value: ", cor_second_lag %>% round(4)))

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

The values obtained from the Pearson’s Correlation Test indicate strong positive correlations between the series and its first lag, and a moderately positive correlation with its second lag. Additionally, the scatter plots (Figure 3.1 & 3.2) visualize this correlation between the series and its first and second lags. Both plots depict a clear positive linear relationship, with the first lag showing a more dominant linear relationship. This suggests that the current values of the series are heavily influenced by their immediate past values, reinforcing the presence of auto-regressive behavior in the data.

Assessing Stationarity

From the visual assessment of the sunspot series in the time series plot (Figure 2), there was no indication of a clear trend being present in the series. However, to confirm this observation, this section will include further analysis using various tests and tools. Specifically, we will analyze the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots, and conduct statistical tests for stationarity. These methods will provide a more rigorous evaluation of the series’ properties and help determine if the initial visual assessment holds true.

ACF and PACF plot

# Use utility script to plot ACF and PACF plot
sunspot_TS %>% plot_acf_pacf(acf_main = "Figure 4.1: Autocorrelation Function (ACF) of Yearly Sunspot Numbers",
                             pacf_main = "Figure 4.2: Partial Autocorrelation Function (PACF) of Yearly Sunspot Numbers",
                             max_lag = 70)

From the ACF plot (Figure 4.1), we do not observe a strict slowly decaying pattern, which is usually expected in a non-stationary series with a distinct trend. However, the ACF plot does show a decaying wave pattern, which strongly indicates seasonality or seasonal-trend in the series, aligning with our conclusion from the time series plot (Figure 2) that the series exhibits seasonality. The PACF plot (Figure 4.2) shows a very high first lag value, which could possibly indicate non-stationarity, but this high value could also be due to the underlying seasonality. Due to the lack of conclusive evidence, we need to employ statistical tests to confirm the presence or absence of stationarity in the series, providing a more definitive assessment.

Note: The above plots are generated using a utility script written specifically for the analysis to avoid any repetetive code chunks. Please refer to the Appendix section for further details.

Statistical Tests for Sationarity

# Augmented Dickey-Fuller Test for stationarity
sunspot_TS %>% adf.test()
## 
##  Augmented Dickey-Fuller Test
## 
## data:  .
## Dickey-Fuller = -5.3319, Lag order = 6, p-value = 0.01
## alternative hypothesis: stationary
# Phillips-Perron Unit Root Test for stationarity
sunspot_TS %>% pp.test()
## 
##  Phillips-Perron Unit Root Test
## 
## data:  .
## Dickey-Fuller Z(alpha) = -83.347, Truncation lag parameter = 5, p-value
## = 0.01
## alternative hypothesis: stationary

Both the Augmented Dickey-Fuller Test and Phillips-Perron Test yield p-values of lower than 0.01, well below the conventional threshold of 0.05, confirming that the yearly sunspot series is stationary. Given that the time series plot (Figure 2), the ACF plot (Figure 4.1), and both the Augmented Dickey-Fuller Test and Phillips-Perron Test show evidence of stationarity in the series, this collective evidence strongly indicates stationarity in our series.

Assessing Normality

This section will investigate the presence of normality in the series, as normality is an underlying assumption of modeling methods like Maximum Likelihood Estimation (MLE), which will be employed in the modeling sections of this analysis. Assessing normality is crucial to ensure the validity of the modeling approach and to make accurate inferences from the data. To visually and statistically assess normality, we can use tools such as histograms, Q-Q plots, and normality tests like the Shapiro-Wilk test.

# Set up a 1x2 layout for plots
par(mfrow=c(1,2))

# Plotting a Histogram for sunspot_TS
sunspot_TS %>% hist(main = "Figure 5.1: Histogram of Sunspot Series",
                    xlab = "Sunspot Values")

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Plotting a QQ plot for sunspot_TS
sunspot_TS %>% qqnorm(main = "Figure 5.2: Normal QQ Plot of Sunspot Series")
sunspot_TS %>% qqline(col='blue', lty=2)

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Perform Shapiro-Wilks test for normality. 
rawTS_shapiro <- sunspot_TS %>% shapiro.test()
rawTS_shapiro
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.92106, p-value = 4.853e-12

The above histogram (Figure 5.1) does not show a clear normal distribution of the sunspot numbers. Similarly, the QQ-Plot (Figure 5.2) does not indicate normality due to the significant deviation of the observations from the reference line. The Shapiro-Wilk test further supports this, with a p-value of 4.853×10−124.853×10−12, which is significantly lower than the conventional threshold of 0.05. This strongly suggests that the series does not exhibit normality. Given this lack of normality, appropriate transformations of the data may need to be considered to improve the normality of the series.

Data Transformation

In order to improve the normality of the series, we will be using the Box-Cox transformation to find an optimal transformation for the series. This process will include determining an optimal lambda value and transforming the data based on this value.

Box-Cox Transformation

# Adding a small positive value to ensure non-zero data
positive_sunspot_TS <- sunspot_TS + (abs(min(sunspot_TS)) + 0.01)

# Applying Box-Cox transformation
BC <- positive_sunspot_TS %>% BoxCox.ar(lambda = seq(-2, 2, 0.01), method = 'yule-walker')

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Add main title separately
title(main = 'Figure 6: Optimal Lambda value', adj = 0.5, line = 1.4)

# Use Log-likelihood estimation to find optimal lambda values
lambda <- BC$lambda[which(max(BC$loglike) == BC$loglike)]

# Print results
cat("Lower bound Confidence Interval: ", BC$ci[1],
    "\nUpper bound Confidence Interval: ", BC$ci[2],
    "\n\nOptimal Lambda value: ", lambda)
## Lower bound Confidence Interval:  0.44 
## Upper bound Confidence Interval:  0.56 
## 
## Optimal Lambda value:  0.49

The optimal lambda value for the Box-Cox transformation is determined using log-likelihood estimation. The Box-Cox transformation function generates a range of lambda values from -2 to 2 with an increment of 0.01. For each lambda value, the log-likelihood of the transformed data is computed. The lambda value corresponding to the highest log-likelihood is considered optimal as it indicates the transformation that best fits the data.

Regarding the confidence intervals, they provide insights into the reliability of the transformation. The confidence intervals, typically derived from statistical tests, indicate a range of lambda values within which the transformation is deemed statistically valid or effective.

These values suggest that lambda values between approximately 0.44 and 0.56 are likely to produce meaningful transformations of the data. Therefore, for the subsequent analysis phases, the optimal lambda value of approximately 0.49 should be considered for applying the Box-Cox transformation.

# Set up a 1x2 layout for plots
par(mfrow=c(3,2))

# Plot original sunspot series across time
sunspot_TS %>% plot(type = 'o',
                    main = "Figure 7.1: Original Sunspot series (1850-2023)", 
                    xlab = "Year", 
                    ylab = "Temperature Anomalies (°C)")

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Plot Box-Cox Transformed sunspot series across time
BC_sunspot_TS <- ((positive_sunspot_TS^lambda) - 1) / lambda

BC_sunspot_TS %>% plot(type = 'o',
                       main = 'Figure 7.2: Box-Cox Transformed Sunspot series (1850-2023)',
                       xlab = 'Year',
                       ylab = 'Transformed Temperature anomalies (°C)')

# Add grid
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# Test for normality with Box-Cox transformed series.
BC_TS_Stest <- BC_sunspot_TS %>% shapiro.test()

# Plotting a Histogram for sunspot_TS
sunspot_TS %>% hist(main = "Figure 7.3: Histogram of Original Sunspot Series",
                    xlab = "Sunspot Values")

# Plotting a Histogram for sunspot_TS
BC_sunspot_TS %>% hist(main = "Figure 7.4: Histogram of Transformed Sunspot Series",
                       xlab = "Sunspot Values")

# Plotting a QQ plot for sunspot_TS
sunspot_TS %>% qqnorm(main = "Figure 7.5: Normal QQ Plot of Original Sunspot series")
sunspot_TS %>% qqline(col='blue', lty=2) %>% 
text(x = 1.4, y = 3, 
     labels = paste0("Shapiro-Wilks (p-value): ", rawTS_shapiro[2]),
     col='blue', cex=1.5)
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

# QQ-Plot Box-Cox transformed series
BC_sunspot_TS %>% qqnorm(main = "Figure 7.6: Normal QQ Plot of Transformed Sunspot series")
BC_sunspot_TS %>% qqline(lty=2, col='blue') %>% 
text(x = 1.2, y = -1.7, 
     labels = paste0("Shapiro-Wilks (p-value): ", BC_TS_Stest[2]),
     col='blue', cex=1.5)
grid(nx = NULL, ny = NULL, col = "gray", lty = "dotted")

Although the Box-Cox transformation does not fully achieve normality for the data, it does result in a significant improvement in the normality of the series. The improved normality of the series can be observed through the QQ-Plot of the original and the transformed series (Figure 7.3 & 7.4) and also from the histogram of the respective series. Initially, the Shapiro-Wilk test produced a p-value of 4.853e−12, which is extremely low and indicates a strong deviation from normality.

After applying the Box-Cox transformation, the Shapiro-Wilk p-value increased to 0.0004. This substantial increase in the p-value reflects a notable enhancement in the normality of the series, making it more suitable for modeling methods that assume normality, such as Maximum Likelihood Estimation (MLE). Given this information we will proceed using the BC_sunspot_TS for the successive part of the analysis.

Note: We do not conduct a stationarity test on the transformed series because we expect the stationarity to remain unaffected by the transformation. The transformation is performed primarily for two reasons: first, to address any variability in our series, and second, to improve the normality of the series. Since the series is already stationary and no ordinary differencing is performed, the stationarity will remain unchanged. However, the appendix section of the report contains additional evidence to confirm this.

Model Specification

Through the initial exploratory data analysis, we identified a strong seasonal component in the series. The ACF and PACF plots of the non-transformed data further supported this observation. Given the presence of seasonality, we have several options for modeling the sunspot number series. For this analysis, we will proceed with a Seasonal AutoRegressive Integrated Moving Average (SARIMA) modeling approach. This approach allows us to account for both the seasonal and non-seasonal patterns in the data, providing a comprehensive framework for accurate modeling and forecasting of the sunspot numbers.

The SARIMA model can be expressed as follows: \(\text{SARIMA}(p,d,q)(P,D,Q)_s\)

where:

Exclusion of Trend Models: Rationale

In the analysis of the sunspot number series, we have opted not to use trend models for several reasons. Instead, we chose the Seasonal AutoRegressive Integrated Moving Average (SARIMA) model, which is better suited for capturing the underlying patterns in the data.

Through initial exploratory data analysis and visual inspection of the time series plot, it became evident that the sunspot numbers exhibit a strong seasonal pattern. The Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots of the non-transformed data further confirmed the presence of seasonality, displaying a decaying wave pattern typical of seasonal data. In contrast, trend models primarily focus on capturing long-term trends and are not designed to account for seasonality effectively.

The sunspot series, characterized by periodic fluctuations approximately every 11 years to be precise these cycle exist in the range of 9-14 years meaning there is a variability in this natural phenomenon which requires a model that can explicitly account for these varying cycles. Trend models do not provide mechanisms to capture such cyclical patterns, leading to inadequate representations and potentially misleading forecasts.

The SARIMA model is specifically designed to handle time series data with both seasonal and non-seasonal components. It incorporates terms to address autoregressive behavior, moving averages, and differencing for both seasonal and non-seasonal parts of the series. This makes it a comprehensive and flexible approach for modeling the complex dynamics observed in the sunspot data. By using SARIMA, we can accurately capture the periodicity, autoregressive influences, and any random shocks present in the data.

Addressing Seasonality

In this section, we aim to determine the optimal seasonal orders for the SARIMA model, which include the seasonal autoregressive (AR) order (P), seasonal differencing (D), seasonal moving average (MA) order (Q), and the periodicity (s). Accurately identifying these parameters is crucial for capturing the seasonal patterns in the time series data.

To achieve this, we will employ a residuals-based approach. This method involves fitting a series of SARIMA models with different combinations of seasonal orders to the time series data. Each model will be evaluated based on how well it captures the underlying seasonal patterns. After fitting each model, we will analyze the residuals for any signs of remaining seasonality. Ideally, the residuals will not show any seasonal pattern, indicating that the model has successfully captured the seasonal structure of the data.

We will use tools such as Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots to examine the residuals. Through this iterative process, we aim to select the most appropriate seasonal orders that provide the best fit for the data. This ensures that the final SARIMA model is robust and capable of accurately capturing the seasonal dynamics present in the time series.

# Plot ACF and PACF plots BC_sunspot_TS
BC_sunspot_TS %>% 
  plot_acf_pacf(acf_main = "Figure 8.1: Autocorrelation Function (ACF) of Transformed Sunspot Numbers",
                pacf_main = "Figure 8.2: Partial Autocorrelation Function (PACF) of Transformed Sunspot Numbers",
                max_lag = 60)

The above ACF of the transformed sunspot series, show that most of the seasonal peaks occur at the 11 year in most cases with only the 4th and 5th seasonal lags showing

Seasonal Differencing (D)

In this section, we will perform seasonal differencing on the time series data using a SARIMA(0,0,0)(0,1,0)[11] model. This model has all parameters set to zero except for the seasonal differencing order (D), which is set to 1. The purpose of this configuration is to remove any seasonal pattern with a periodicity of 11 units from the data.

By setting the seasonal differencing component (D) to 1, we expect the residuals of the model to show no remaining seasonal patterns, indicating that the seasonal effects have been effectively removed. This process helps in isolating the underlying trends and irregular components of the time series.

It is crucial to note that setting higher orders for the seasonal differencing parameter (D) can lead to data loss and may result in over-differencing. Over-differencing can distort the data and obscure meaningful patterns, making the time series analysis less reliable. Therefore, careful consideration is required when selecting the order of differencing to ensure that the seasonal effects are adequately removed without compromising the integrity of the data.

# Checking residuals of seasonally differenced model - SARIMA(0,0,0)(0,1,0)[11]
BC_sunspot_TS %>% check_residuals(order = c(0,0,0), 
                                  seasonal_order = c(0,1,0),
                                  period = 11,
                                  acf_lag = 70,
                                  pacf_lag = 70,
                                  res_main = 'Figure 9.1: Residuals of seasonally differenced model SARIMA(0,0,0)(0,1,0)[11]',
                                  hist_main = 'Figure 9.2: Histogram of residuals',
                                  acf_main = 'Figure 9.3: ACF plot of residuals',
                                  pacf_main = 'Figure 9.4: PACF plot of residuals')

We observe that seasonality in the series is significantly reduced, as evidenced by the residuals, which exhibit minimal signs of seasonality. However, it’s important to note that not all seasonal components have been fully addressed. This is indicated by the Autocorrelation Function (ACF) and Partial Autocorrelation Function (PACF) plots, which show significant autocorrelation at the seasonal lags. To address this remaining seasonality, we need to determine the appropriate seasonal Autoregressive (AR) and Moving Average (MA) orders for the series.

From the ACF plot, we observe three significant seasonal lags at the first and second seasonal periods. Additionally, the fifth season exhibits a near-significant lag. Given this information, the potential orders for the seasonal MA component (QQ) are likely to be 2 or 3, incorporating the fifth lag.

The PACF plot suggests that the seasonal AR order (PP) is likely to be 0 or 1. This inference is based on the behavior of the PACF, which shows a rapid decay after the first lag.

Seasonal Orders (P, Q)

# Checking residuals of seasonally differenced model with P and Q orders - SARIMA(0,0,0)(1,1,2)[11]
BC_sunspot_TS %>% check_residuals(order = c(0,0,0), 
                                  seasonal_order = c(1,1,2),
                                  period = 11,
                                  acf_lag = 70,
                                  pacf_lag = 70,
                                  res_main = 'Figure 10.1: Residuals of seasonally differenced model SARIMA(0,0,0)(1,1,2)[11]',
                                  hist_main = 'Figure 10.2: Histogram of residuals',
                                  acf_main = 'Figure 10.3: ACF plot of residuals',
                                  pacf_main = 'Figure 10.4: PACF plot of residuals')

We can observe from the ACF (Figure 10.3) and PACF plot (Figure 10.4) there are no more significant spikes at the seasonal lags. The residuals plot also shows significant reduction in the seasonality when compared to the original series. After experimenting with possible combinations of the P and Q derived from the previous section of the analysis. The SARIMA(0,0,0)(1,1,2)[11] model shows the best results in addressing the seasonal component underlying in the data.

Given this information we, set the seasonal orders (P,D,Q) as (1,1,2)

Appendix

Dertermining Frequency

Visual Assessment of ACF

Residuals Check for Frequency

BC_sunspot_TS %>% diff(lag=11) %>%
  ggtsdisplay(xlab="Year",
    main="Seasonally differenced H02 scripts")

model <- Arima(BC_sunspot_TS, order=c(0,0,0), seasonal=list(order=c(1,1,2), period=11))

model %>% coeftest()
## 
## z test of coefficients:
## 
##       Estimate Std. Error z value  Pr(>|z|)    
## sar1  0.447910   0.096608  4.6364 3.545e-06 ***
## sma1 -0.822370   0.109741 -7.4938 6.693e-14 ***
## sma2 -0.177593   0.102031 -1.7406   0.08176 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
model %>% summary()
## Series: BC_sunspot_TS 
## ARIMA(0,0,0)(1,1,2)[11] 
## 
## Coefficients:
##         sar1     sma1     sma2
##       0.4479  -0.8224  -0.1776
## s.e.  0.0966   0.1097   0.1020
## 
## sigma^2 = 23.96:  log likelihood = -952.59
## AIC=1913.18   AICc=1913.31   BIC=1928.16
## 
## Training set error measures:
##                     ME     RMSE      MAE       MPE     MAPE      MASE      ACF1
## Training set 0.6016577 4.787958 3.570949 -14.86451 45.60751 0.8670863 0.7619912
checkresiduals(model, lag=30)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,0,0)(1,1,2)[11]
## Q* = 331.22, df = 27, p-value < 2.2e-16
## 
## Model df: 3.   Total lags used: 30